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Abstract 

Ideas  from  the  image  processing  literature  have  recently  motivated  a  new  set  of 
clustering  algorithms  that  rely  on  the  concept  of  total  variation.  While  these  al¬ 
gorithms  perform  well  for  bi-partitioning  tasks,  their  recursive  extensions  yield 
unimpressive  results  for  multiclass  clustering  tasks.  This  paper  presents  a  general 
framework  for  multiclass  total  variation  clustering  that  does  not  rely  on  recursion. 
The  results  greatly  outperform  previous  total  variation  algorithms  and  compare 
well  with  state-of-the-art  NMF  approaches. 


1  Introduction 

Many  clustering  models  rely  on  the  minimization  of  an  energy  over  possible  partitions  of  the  data 
set.  These  discrete  optimizations  usually  pose  NP-hard  problems,  however.  A  natural  resolution 
of  this  issue  involves  relaxing  the  discrete  minimization  space  into  a  continuous  one  to  obtain  an 
easier  minimization  procedure.  Many  current  algorithms,  such  as  spectral  clustering  methods  or 
non-negative  matrix  factorization  (NMF)  methods,  follow  this  relaxation  approach. 

A  fundamental  problem  arises  when  using  this  approach,  however;  in  general  the  solution  of  the 
relaxed  continuous  problem  and  that  of  the  discrete  NP-hard  problem  can  differ  substantially.  In 
other  words,  the  relaxation  is  too  loose.  A  tight  relaxation,  on  the  other  hand,  has  a  solution  that 
closely  matches  the  solution  of  the  original  discrete  NP-hard  problem.  Ideas  from  the  image  pro¬ 
cessing  literature  have  recently  motivated  a  new  set  of  algorithms  [17,  18,  11,  12,  4,  15,  3,  2,  13,  10] 
that  can  obtain  tighter  relaxations  than  those  used  by  NMF  and  spectral  clustering.  These  new  algo¬ 
rithms  all  rely  on  the  concept  of  total  variation.  Total  variation  techniques  promote  the  formation  of 
sharp  indicator  functions  in  the  continuous  relaxation.  These  functions  equal  one  on  a  subset  of  the 
graph,  zero  elsewhere  and  exhibit  a  non-smooth  jump  between  these  two  regions.  In  contrast  to  the 
relaxations  employed  by  spectral  clustering  and  NMF,  total  variation  techniques  therefore  lead  to 
quasi-discrete  solutions  that  closely  resemble  the  discrete  solution  of  the  original  NP-hard  problem. 
They  provide  a  promising  set  of  clustering  tools  for  precisely  this  reason. 

Previous  total  variation  algorithms  obtain  excellent  results  for  two  class  partitioning  problems 
[18,  11,  12,  3]  .  Until  now,  total  variation  techniques  have  relied  upon  a  recursive  bi-partitioning 
procedure  to  handle  more  than  two  classes.  Unfortunately,  these  recursive  extensions  have  yet  to 
produce  state-of-the-art  results.  This  paper  presents  a  general  framework  for  multiclass  total  varia¬ 
tion  clustering  that  does  not  rely  on  a  recursive  procedure.  Specifically,  we  introduce  a  new  discrete 
multiclass  clustering  model,  its  corresponding  continuous  relaxation  and  a  new  algorithm  for  opti¬ 
mizing  the  relaxation.  Our  approach  also  easily  adapts  to  handle  either  unsupervised  or  transductive 
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clustering  tasks.  The  results  significantly  outperform  previous  total  variation  algorithms  and  com¬ 
pare  well  against  state-of-the-art  approaches  [19,  20,  1],  We  name  our  approach  Multiclass  Total 
Variation  clustering  (MTV  clustering). 


2  The  Multiclass  Balanced- Cut  Model 


Given  a  weighted  graph  G  =  ( V.  W )  we  let  V  =  {xi, . . . ,  xy }  denote  the  vertex  set  and  W  := 
{u>ij}i<i,j<N  denote  the  non-negative,  symmetric  similarity  matrix.  Each  entry  iutj  of  W  encodes 
the  similarity,  or  lack  thereof,  between  a  pair  of  vertices.  The  classical  balanced-cut  (or,  Cheeger 
cut)  [7,  8]  asks  for  a  partition  of  V  =  A  U  Ac  into  two  disjoint  sets  that  minimizes  the  set  energy 


Bal(A)  := 


Cut  (A,AC) 
min{|j4|,  |AC|} 


SxiGA,Xj6Ac  wij 
min{|  A|,  |AC|} 


(1) 


A  simple  rationale  motivates  this  model:  clusters  should  exhibit  similarity  between  data  points, 
which  is  reflected  by  small  values  of  Cut  (A.  Ac),  and  also  form  an  approximately  equal  sized  parti¬ 
tion  of  the  vertex  set.  Note  that  min{|A|,  |AC|}  attains  its  maximum  when  \A\  =  \AC\  =  ./V/2,sothat 
for  a  given  value  of  Cut(A,  A r )  the  minimum  occurs  when  A  and  A'  have  approximately  equal  size. 

We  generalize  this  model  to  the  multiclass  setting  by  pursuing  the  same  rationale.  For  a  given 
number  of  classes  R  (that  we  assume  to  be  known)  we  formulate  our  generalized  balanced-cut 
problem  as 


R 

Minimize 

r—1 


Cut  (Ar,  A£) 
min{A|Ar|,  \Acr\} 


over  all  disjoint  partitions  Ar  n  As  =  0,  A\  U  •  •  •  U  Ar  =  V  of  the  vertex  set. 


(P) 


In  this  model  the  parameter  A  controls  the  sizes  of  the  sets  Ar  in  the  partition.  Previous  work  [4] 
has  used  A  =  1  to  obtain  a  multiclass  energy  by  a  straightforward  sum  of  the  two-class  balanced-cut 
terms  (1).  While  this  follows  the  usual  practice,  it  erroneously  attempts  to  enforce  that  each  set  in 
the  partition  occupy  half  of  the  total  number  of  vertices  in  the  graph.  We  instead  select  the  parameter 
A  to  ensure  that  each  of  the  classes  approximately  occupy  the  appropriate  fraction  1/i?  of  the  total 
number  of  vertices.  As  the  maximum  of  min{A|Ar|,  |A£|}  occurs  when  A|Ar|  =  |A£|  =  N  —  \Ar\, 
we  see  that  A  =  R  —  1  is  the  proper  choice. 

This  general  framework  also  easily  incorporates  a  priori  known  information,  such  as  a  set  of  labels 
for  transductive  learning.  If  Lr  C  V  denotes  a  set  of  data  points  that  are  a  priori  known  to  belong 
to  class  r  then  we  simply  enforce  Lr  C  A  r  in  the  definition  of  an  allowable  partition  of  the  vertex 
set.  In  other  words,  any  allowable  disjoint  partition  Ar  H  As  =  0,  Ai  U  •  •  •  U  An  =  V  must  also 
respect  the  given  set  of  labels. 


3  Total  Variation  and  a  Tight  Continuous  Relaxation 


We  derive  our  continuous  optimization  by  relaxing  the  set  energy  (P)  to  the  continuous  energy 


R 


£(*■)  =  £ 

r—l 


WfrWrv 

Wfr  -medx(/r)||liA' 


(2) 


Here  F  :=  [/i, . . .  Jr]  £  Matx-rGOi  1])  denotes  the  N  x  R  matrix  that  contains  in  its  columns  the 
relaxed  optimization  variables  associated  to  the  R  clusters.  A  few  definitions  will  help  clarify  the 
meaning  of  this  formula.  The  total  variation  ||/||tv  °f  a  vertex  function  /  :  V  — >  M  is  defined  by 


\\f\Wv  =  £«%|/(xi)  -  /(Xj) |.  (3) 

i= 1 


Alternatively,  if  we  view  a  vertex  function  /  as  a  vector  (/(x i), . . . ,  /(xjv))*  £  Rw  then  we  can 
write 


\\f\\Tv  ■■=  \\Kf\U. 


(4) 
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Here  K  £  M Mxiv(K)  denotes  the  gradient  matrix  of  a  graph  with  M  edges  and  N  vertices.  Each 
row  of  I\  corresponds  to  an  edge  and  each  column  corresponds  to  a  vertex.  For  any  edge  (i,j)  in 
the  graph  the  corresponding  row  in  the  matrix  K  has  an  entry  wrj  in  the  column  corresponding  to 
the  ith  vertex,  an  entry  in  the  column  corresponding  to  the  jth  vertex  and  zeros  otherwise. 

To  make  sense  of  the  remainder  of  (2)  we  must  introduce  the  asymmetric  £  '  -norm.  This  variant  of 
the  classical  f1-norm  gives  different  weights  to  positive  and  negative  values: 

1/ I  l.A  =  E  /(X')Ia  where  1  a  =  lX't  (5) 

i= 1 

Finally  we  define  the  A-median  (or  quantile),  denoted  medA(/),  as: 

medA(/)  =  the  (k  +  l)st  largest  value  in  the  range  of  /,  where  k  =  |_iV/ (A  +  1)J .  (6) 

These  definitions,  as  well  as  the  relaxation  (2)  itself,  were  motivated  by  the  following  theorem.  Its 
proof,  in  the  supplementary  material,  relies  only  the  three  preceding  definitions  and  some  simple 
algebra. 

Theorem  1.  Iff  =  1  A  is  the  indicator  function  of  a  subset  A  C  V  then 

WfWrv  =  2  Cut (A,AC) 

||/  -  medA(/)||  1A  min{A|A|, \AC\}  ' 


The  preceding  theorem  allows  us  to  restate  the  original  set  optimization  problem  (P)  in  the  equivalent 
discrete  form 


Minimize 


I  .fr  1 1  TV 

||/r--medA(/r)||1A 


over  non-zero  functions  /i, . . . ,  /r  :  V  — >  {0, 1}  such  that  /i  +  . . .  +  /r  =  ly. 


(P’) 


Indeed,  since  the  non-zero  functions  fr  can  take  only  two  values,  zero  or  one,  they  must  define  indi¬ 
cator  functions  of  some  nonempty  set.  The  simplex  constraint  /i  +  . . .  +  /r  =  ly  then  guarantees 
that  the  sets  Ar  :=  {xj  £  V  :  /r(x,)  =  1}  form  a  partition  of  the  vertex  set.  We  obtain  the  relaxed 
version  (P-rlx)  of  (P’)  in  the  usual  manner  by  allowing  fr  £  [0, 1]  to  have  a  continuous  range.  This 
yields 


Minimize 


I  LA- 1 1  TV 

||/r-medA(/r)||1)A 


over  functions  /i, . . . ,  /r  :  V  —>  [0, 1]  such  that  /i  +  . . .  +  /r  =  ly. 


(P-rlx) 


The  following  two  points  form  the  foundation  on  which  total  variation  clustering  relies: 

1  —  As  the  next  subsection  details,  the  total  variation  terms  give  rise  to  quasi-indicator  functions. 
That  is,  the  relaxed  solutions  [f1, . . . ,  /r]  of  (P-rlx)  mostly  take  values  near  zero  or  one  and  exhibit 
a  sharp,  non-smooth  transition  between  these  two  regions.  Since  these  quasi-indicator  functions  es¬ 
sentially  take  values  in  the  discrete  set  {0,1}  rather  than  the  continuous  interval  [0, 1],  solving  (P-rlx) 
is  almost  equivalent  to  solving  either  (P)  or  (P’).  In  other  words,  (P-rlx)  is  a  tight  relaxation  of  (P). 

2  —  Both  functions  /  i— >  ||/||xy  and  /  i— >•  \\f  —  medA(/)||ijA  are  convex.  The  simplex  constraint 
in  (P-rlx)  is  also  convex.  Therefore  solving  (P-rlx)  amounts  to  minimizing  a  sum  of  ratios  of  convex 
functions  with  convex  constraints.  As  the  next  section  details,  this  fact  allows  us  to  use  machinery 
from  convex  analysis  to  develop  an  efficient,  novel  algorithm  for  such  problems. 


3.1  The  Role  of  Total  Variation  in  the  Formation  of  Quasi-Indicator  Functions 


To  elucidate  the  precise  role  that  the  total  variation  plays  in  the  formation  of  quasi-indicator  func¬ 
tions,  it  proves  useful  to  consider  a  version  of  (P-rlx)  that  uses  a  spectral  relaxation  in  place  of  the 
total  variation: 


Minimize 


E 


r— 1 


1 1  /r  1 1  Lap 

\\fr  —  medA(/r)||  1A 


over  functions  /i, . . . ,  /r  :  V  — >  [0, 1]  such  that  /i  +  . . .  +  /r  =  ly 


(P-rlx2) 
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Here  1 1 / 1 1 Lap  =  ST=i  Wii  |/(xi)  —  /(xi)|2  denotes  the  spectral  relaxation  of  Cut(A,  Ac)\  it  equals 
(/,  Lf)  if  L  denotes  the  unnormalized  graph  Laplacian  matrix.  Thus  problem  (P-rlx2)  relates 
to  spectral  clustering  (and  therefore  NMF  [9])  with  a  positivity  constraint.  Note  that  the  only 
difference  between  (P-rlx2)  and  (P-rlx)  is  that  the  exponent  2  appears  in  ||  ■  ||Lap  while  the  ex¬ 
ponent  1  appears  in  the  total  variation.  This  simple  difference  of  exponent  has  an  important  conse¬ 
quence  for  the  tightness  of  the  relaxations.  Figure  1  presents  a  simple  example  that  illuminates  this 
difference.  If  we  bi-partition  the  depicted  graph,  i.e.  a  line  with  20  vertices  and  edge  weights 
vj,  ,;+1  =  1,  then  the  optimal  cut  lies  between  vertex  10  and  vertex  11  since  this  gives  a  per¬ 
fectly  balanced  cut.  Figure  1(a)  shows  the  vertex  function  /i  generated  by  (P-rlx)  while  figure 
1(b)  shows  the  one  generated  by  (P-rlx2).  Observe  that  the  solution  of  the  total  variation  model 
coincides  with  the  indicator  function  of  the  desired  cut  whereas  the  the  spectral  model  prefers  its 
smoothed  version.  Note  that  both  functions  in  figure  la)  and  lb)  have  exactly  the  same  total  vari¬ 
ation  II/IItv  =  |/(xi)  -  f  (x2 ) |  + - P  |/(xi9)  -  f  (x20) |  =  /(x i)  -  /(x20)  =  1  since  both 

functions  are  monotonic.  The  total  variation  model  will  therefore  prefer  the  sharp  indicator  function 
since  it  differs  more  from  its  A-median  than  the  smooth  indicator  function.  Indeed,  the  denominator 
||  fr  —  medA(/r)||i,A  is  larger  for  the  sharp  indicator  function  than  for  the  smooth  one.  A  differ¬ 
ent  scenario  occurs  when  we  replace  the  exponent  one  in  ||  •  \\rv  by  an  exponent  two,  however.  As 

JI/llLp  =  I/(xi)-/(x2)|2H - P  |/(xig)-/(x2o)|2  and  t2  <  t  when  f  <  1  it  follows  that  ||/||Lap 

is  much  smaller  for  the  smooth  function  than  for  the  sharp  one.  Thus  the  spectral  model  will  prefer 
the  smooth  indicator  function  despite  the  fact  that  it  differs  less  from  its  A-median.  We  therefore 
recognize  the  total  variation  as  the  driving  force  behind  the  formation  of  sharp  indicator  functions. 

(2M1MD— (5MIMD— ©— ®— ■ d>-®— ^ #— < 
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(a)  (b) 

Figure  1:  Top:  The  graph  used  for  both  relaxations.  Bottom  left:  the  solution  given  by  the  total 
variation  relaxation.  Bottom  right:  the  solution  given  by  the  spectral  relaxation.  Position  along  the 
.x'-axis  =  vertex  number,  height  along  the  ?/-axis  =  value  of  the  vertex  function. 


This  heuristic  explanation  on  a  simple,  two-class  example  generalizes  to  the  multiclass  case  and 
to  real  data  sets  (see  figure  2).  In  simple  terms,  quasi-indicator  functions  arise  due  to  the  fact  that 
the  total  variation  of  a  sharp  indicator  function  equals  the  total  variation  of  a  smoothed  version  of 
the  same  indicator  function.  The  denominator  \\fr  —  medx(/r)||i.A  then  measures  the  deviation  of 
these  functions  from  their  A-median.  A  sharp  indicator  function  deviates  more  from  its  median  than 
does  its  smoothed  version  since  most  of  its  values  concentrate  around  zero  and  one.  The  energy 
is  therefore  much  smaller  for  a  sharp  indicator  function  than  for  a  smooth  indicator  function,  and 
consequently  the  total  variation  clustering  energy  always  prefers  sharp  indicator  functions  to  smooth 
ones.  For  bi-partitioning  problems  this  fact  is  well-known.  Several  previous  works  have  proven  that 
the  relaxation  is  exact  in  the  two-class  case;  that  is,  the  total  variation  solution  coincides  with  the 
solution  of  the  original  NP-hard  problem  [8,  18,  3,  5]. 

Figure  2  illustrates  the  result  of  the  difference  between  total  variation  and  NMF  relaxations  on  the 
data  set  OPTDIGITS,  which  contains  5620  images  of  handwritten  numerical  digits.  Figure  2(a) 
shows  the  quasi-indicator  function  f\  obtained,  before  thresholding,  by  our  MTV  algorithm  while 
2(b)  shows  the  function  fn  obtained  from  the  NMF  algorithm  of  [1],  We  extract  the  portion  of  each 
function  corresponding  to  the  digits  four  and  nine,  then  sort  and  plot  the  result.  The  MTV  relaxation 
leads  a  sharp  transition  between  the  fours  and  the  nines  while  the  NMF  relaxation  leads  to  a  smooth 
transition. 


4 


Figure  2:  Left:  Solution  /4  from  our  MTV  algorithm  (before  thresholding)  plotted  over  the  fours 
and  nines.  Right:  Solution  /4  from  LSD  [1]  plotted  over  the  fours  and  nines. 

3.2  Transductive  Framework 

From  a  modeling  point-of-view,  the  presence  of  transductive  labels  poses  no  additional  difficulty.  In 
addition  to  the  simplex  constraint 

FGE:=|F1eM7Vxfl([0,l]):/r(xl)>0!  ^/r(Xi)  =  l|  (7) 

required  for  unsupervised  clustering  we  also  impose  the  set  of  labels  as  a  hard  constraint.  If 
Li, . . . ,  Lr  denote  the  R  vertex  subsets  representing  the  labeled  points,  so  that  x,  G  Lr  means 
x,  belongs  to  class  r.  then  we  may  enforce  these  labels  by  restricting  F  to  lie  in  the  subset 

F  G  A  :=  {F  G  MtvX-r([0,  1])  :  Vr,  (/i(xj), . . . ,  fR(xi))  =  er  Vx,Gir}.  (8) 
Here  er  denotes  the  row  vector  containing  a  one  in  the  rth  location  and  zeros  elsewhere.  Our  model 
for  transductive  classification  then  aims  to  solve  the  problem 

R  II  f  II  1 

Minimize  — rrr  over  matrices  F  G  E  n  A.  >  (P-trans) 

^  ||/r-medA(/r)||1)A  J 

Note  that  E  n  A  also  defines  a  convex  set,  so  this  minimization  remains  a  sum  of  ratios  of  convex 
functions  subject  to  a  convex  constraint.  Transductive  classification  therefore  poses  no  additional 
algorithmic  difficulty,  either.  In  particular,  we  may  use  the  proximal  splitting  algorithm  detailed  in 
the  next  section  for  both  unsupervised  and  transductive  classification  tasks. 

4  Proximal  Splitting  Algorithm 

This  section  details  our  proximal  splitting  algorithm  for  finding  local  minimizers  of  a  sum  of  ratios 
of  convex  functions  subject  to  a  convex  constraint.  We  start  by  showing  in  the  first  subsection  that 
the  functions 

T(f)  ■■=  Wfhv  and  B(/):=||/-medA(/)l||liA  (9) 

involved  in  (P-rlx)  or  (P-trans)  are  indeed  convex.  We  also  give  an  explicit  formula  for  a  subdiffer¬ 
ential  of  B  since  our  proximal  splitting  algorithm  requires  this  in  explicit  form.  We  then  summarize 
a  few  properties  of  proximal  operators  before  presenting  the  algorithm. 

4.1  Convexity,  Subgradients  and  Proximal  Operators 

Recall  that  we  may  view  each  function  /  :  V  — >  R  as  a  vector  in  with  /(xj)  as  the  ith  component 
of  the  vector.  We  may  then  view  T  and  B  as  functions  from  R N  to  R.  The  next  theorem  states  that 
both  B  and  T  define  convex  functions  on  R;V  and  furnishes  an  element  v  G  dB(f)  by  means  of  an 
easily  computable  formula.  The  formula  for  the  subdifferential  generalizes  a  related  result  for  the 
symmetric  case  [11]  to  the  asymmetric  setting.  We  provide  its  proof  in  the  supplementary  material. 
Theorem  2.  The  functions  B  and  T  are  convex.  Moreover,  given  f  G  RN  the  vector  v  G  Rw 
defined  below  belongs  to  dB(f): 

( A  */7(xi)  >  medA(/)  ( n°  =  |{x*  G  V  :  /(x4)  =  medA(/)}| 

w(xi)  =  <  i/7(xi)  =  medA(/)  where  l  n~  =  |{x*  G  V  :  ffc)  <  medA(/)}| 

I  -1  iff(xi)  <  med A(/)  [n+  =  |{x4  G  V  :  /(Xi)  >  medA(/)}| 
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In  the  above  theorem  dB(f )  denotes  the  subdifferential  of  B  at  f  and  v  £  dB(f)  denotes  a  subgra¬ 
dient.  Given  a  convex  function  A  :  Rw  — »  R,  the  proximal  operator  of  A  is  defined  by 

Pro *A(g)  ■=  argmin  A(f)  +  \\\f  -  g\\22.  (10) 

/  6R«  z 

If  we  let  Sc  denote  the  barrier  function  of  the  convex  set  C,  that  is 

sc(f)  =  Oif  /  e  Cand  Sc(f)  =  +oo  \f  f  f  C,  (11) 

then  we  easily  see  that  prox(5c,  is  simply  the  least-squares  projection  on  C,  in  other  words, 

prox^  (/)  =  projc(/)  :=  argmin  b\\f  —  g H^-  In  this  manner  the  proximal  operator  defines  a 

a'  c 

mapping  from  R^  to  Rw  that  generalizes  the  least-squares  projection  onto  a  convex  set. 


4.2  The  Algorithm 


We  can  rewrite  the  problem  (P-rlx)  or  (P-trans)  as 

Ft. 

Minimize  Sc{F )  +  E  E(fr)  over  all  matrices  F  =  [/i,...,/r]  sMjvxh  (12) 

r=  1 

where  E(fr)  =  T(fr)/ B(fr)  denotes  the  energy  of  the  quasi-indicator  function  of  the  rth  cluster. 
The  set  C  =  E  or  C  =  E  n  A  is  the  convex  subset  of  M^xi?  that  encodes  the  simplex  constraint 
(7)  or  the  simplex  constraint  with  labels.  The  corresponding  function  Sc{F),  defined  in  (11),  is 
the  barrier  function  of  the  desired  set.  Beginning  from  an  initial  iterate  F°  £  C  we  propose  the 
following  proximal  splitting  algorithm: 

Fk+1  :=  pro  xTk+Sc(Fk  +  8Bk(Fk)).  (13) 

Here  Tk{F)  and  Bk(F)  denote  the  convex  functions 


Tk{F)  :=J2ckT(fr) 

r= 1 


Bk{F)  :=^drfcF(/r), 

r=  1 


the  constants  (ck.  dk)  are  computed  using  the  previous  iterate 

ck  =  A  k/B(fk)  and  dk  =  A  kE(fk)/B(fk) 


and  Ak  denotes  the  timestep  for  the  current  iteration.  This  choice  of  the  constants  ( ck,dk )  yields 
Bk(Fk)  =  Fk(Fk),  and  this  fundamental  property  allows  us  to  derive  (see  supplementary  material) 
the  energy  descent  estimate: 

Theorem  3  (Estimate  of  the  energy  descent).  Each  of  the  Fk  belongs  to  C,  and  if  Bk  f  0  then 


R 


E 


Bk+ 1 

r 

Bk 

r 


{Ek  -  Ek+1)  > 


||^fe  _  _pfc+l||2 

Afc 


(14) 


where  Bk,Ek  stand  for  B(fk),  E(fk). 


Inequality  (14)  states  that  the  energies  of  the  quasi-indicator  functions  (as  a  weighted  sum)  decrease 
at  every  step  of  the  algorithm.  It  also  gives  a  lower  bound  for  how  much  these  energies  decrease.  As 
the  algorithm  progress  and  the  iterates  stabilize  the  ratio  Bk+1/Bk  converges  to  1,  in  which  case 
the  sum,  rather  than  a  weighted  sum,  of  the  individual  cluster  energies  decreases. 

Our  proximal  splitting  algorithm  (13)  requires  two  steps.  The  first  step  requires  computing  Gk  = 
Fk  +  dBk{Fk),  and  this  is  straightforward  since  theorem  2  provides  the  subdifferential  of  B,  and 
therefore  of  Bk ,  through  an  explicit  formula.  The  second  step  requires  computing  prox7-fc+(5c,  (Gfc), 
which  seems  daunting  at  a  first  glance.  Fortunately,  minimization  problems  of  this  form  play  an 
important  role  in  the  image  processing  literature.  Recent  years  have  therefore  produced  several  fast 
and  accurate  algorithms  for  computing  the  proximal  operator  of  the  total  variation.  As  Tk  +  Sc  con¬ 
sists  of  a  weighted  sum  of  total  variation  terms  subject  to  a  convex  constraint,  we  can  readily  adapt 
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these  algorithms  to  compute  the  second  step  of  our  algorithm  efficiently.  In  this  work  we  use  the 
primal-dual  algorithm  of  [6]  with  acceleration.  This  relies  on  a  proper  uniformly  convex  formulation 
of  the  proximal  minimization,  which  we  detail  completely  in  the  supplementary  material. 

The  primal-dual  algorithm  we  use  to  compute  prox7-fc+(5c,  ( Gk )  produces  a  sequence  of  approximate 
solutions  by  means  of  an  iterative  procedure.  A  stopping  criterion  is  therefore  needed  to  indicate 
when  the  current  iterate  approximates  the  actual  solution  prox-p.^.  ( Gk )  sufficiently.  Ideally,  we 
would  like  to  terminate  Fk+1  sa  prox7-fc+(5c  (Gfe)  in  such  a  manner  so  that  the  energy  descent 
property  (14)  still  holds  and  Fk+1  always  satisfies  the  required  constraints.  In  theory  we  cannot 
guarantee  that  the  energy  estimate  holds  for  an  inexact  solution.  We  may  note,  however,  that  a 
slightly  weaker  version  of  the  energy  estimate  (14) 


E 


r=  1 


Bk+ 1 

r 

Bk 

r 


(. Ek  -  Ek+1 )  >  (1 


_  pk+l 

Ak 


2 


(15) 


holds  after  a  finite  number  of  iterations  of  the  inner  minimization.  Moreover,  this  weaker  version 
still  guarantees  that  the  energies  of  the  quasi-indicator  functions  decrease  as  a  weighted  sum  in 
exactly  the  same  manner  as  before.  In  this  way  we  can  terminate  the  inner  loop  adaptively:  we  solve 
pk+i  K  prox-j-fe^.,  ( Gk )  less  precisely  when  Fk+1  lies  far  from  a  minimum  and  more  precisely  as 
the  sequence  { Fk }  progresses.  This  leads  to  a  substantial  increase  in  efficiency  of  the  full  algorithm. 

Our  implementation  of  the  proximal  splitting  algorithm  also  guarantees  that  Fk+1  always  satisfies 
the  required  constraints.  We  accomplish  this  task  by  implementing  the  primal-dual  algorithm  in 
such  a  way  that  each  inner  iteration  always  satisfies  the  constraints.  This  requires  computing  the 
projection  projc  (F)  exactly  at  each  inner  iteration.  The  overall  algorithm  remains  efficient  provided 
we  can  compute  this  projection  quickly.  When  C  =  E  the  algorithm  [14]  performs  the  required 
projection  in  at  most  R  steps.  When  C  =  En  A  the  computational  effort  actually  decreases,  since  in 
this  case  the  projection  consists  of  a  simplex  projection  on  the  unlabeled  points  and  straightforward 
assignment  on  the  labeled  points.  Overall,  each  iteration  of  the  algorithm  scales  like  0(NR2)  + 
O(MR)  +  0(RN  log(iV))  for  the  simplex  projection,  application  of  the  gradient  matrix  and  the 
computation  of  the  balance  terms,  respectively. 

We  may  now  summarize  the  full  algorithm,  including  the  proximal  operator  computation.  In  prac¬ 
tice  we  find  the  choices  Ak  =  max{Bk, . . . ,  B1^}  and  any  small  e  work  well,  so  we  present  the 
algorithm  with  these  choices.  Recall  the  matrix  K  in  (4)  denotes  the  gradient  matrix  of  the  graph. 


Algorithm  1  Proximal  Splitting  Algorithm 


Input:  FeC,P  =  0,L=  \\K\\2,t  =  L~1,e  =  1CT3 
while  loop  not  converged  do 

// Perform  outer  step  Gk  =  Fk  +  OBk(Fk ) 

A  =  maxr  B(fr)  Aq  =  minr  B(fr)  o  =  Aq(tA2L2)_1  F  =  F 


Ejf  i) 
B(h) 


EUr) 

BUr) 


Db  =  diag 


A 


• ,  dB(fR)\DE  (using  theorem  2) 


B(fR ) 


De  =  diag 

V  =  A[dB(f1),. 

G  =  F  +  V 

II Perform  Fk+1  «  proX'j-k+$c(Gk)  until  energy  estimate  holds 

while  (15)  fails  do 

P  =  P  +  gKFDb  P  =  P/  max{|P|,  1}  (both  operations  entriwise)  F0id  =  F 

F  =  F  -tIPPDb  F  =  (F  +  tG)/{1  +  t)  F  =  projc(F) 
e  =  l/Vl  +  2 T  t  =  0t  o  =  <j/0  F  =  (1  +  9)F  -  9Fold 

end  while 
end  while 


5  Numerical  Experiments 

We  now  demonstrate  the  MTV  algorithm  for  unsupervised  and  transductive  clustering  tasks.  We 
selected  six  standard,  large-scale  data  sets  as  a  basis  of  comparison.  We  obtained  the  first  data  set 
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(4MOONS)  and  its  similarity  matrix  from  [4]  and  the  remaining  five  data  sets  and  matrices  (WE- 
BKB4,  OPTDIGITS,  PENDIGITS,  20NEWS,  MNIST)  from  [19].  The  4MOONS  data  set  contains 
4K  points  while  the  remaining  five  contain  4.2K,  5.6K,  1  IK,  20K  and  70K  points,  respectively. 

Our  first  set  of  experiments  compares  our  MTV  algorithm  against  other  unsupervised  approaches. 
We  compare  against  two  previous  total  variation  algorithms  [11,  3],  which  rely  on  recursive  bi¬ 
partitioning,  and  two  top  NMF  algorithms  [1,  19].  We  use  the  normalized  Cheeger  cut  versions 
of  [11]  and  [3]  with  default  parameters.  We  used  the  code  available  from  [19]  to  test  each  NMF 
algorithm.  All  non-recursive  algorithms  (LSD  [1],  NMFR  [19],  MTV)  received  two  types  of  initial 
data:  (a)  the  deterministic  data  used  in  [19];  (b)  a  random  procedure  leveraging  normalized-cut 
[16],  Procedure  (b)  first  selects  one  data  point  uniformly  at  random  from  each  computed  NCut 
cluster,  then  sets  fr  equal  to  one  at  the  data  point  drawn  from  the  rth  cluster  and  zero  otherwise. 
We  then  propagate  this  initial  stage  by  replacing  each  fr  with  (I  +  L)-1/r  where  L  denotes  the 
unnormalized  graph  Laplacian.  Finally,  to  aid  the  NMF  algorithms,  we  add  a  small  constant  0.2  to 
the  result  (each  performed  better  than  without  adding  this  constant).  For  MTV  we  use  (a)  and  30 
random  trials  of  (b)  then  report  the  cluster  purity  (c.f.  [19]  for  a  definition  of  purity)  of  the  solution 
with  the  lowest  discrete  energy  (P).  We  then  use  each  NMF  with  exactly  the  same  initial  conditions 
and  report  simply  the  highest  purity  achieved  over  all  31  runs.  This  biases  the  results  in  favor  of  the 
NMF  algorithms.  Due  to  the  non-convex  nature  of  these  algorithms,  the  random  initialization  gave 
the  best  results  and  significantly  improved  upon  previously  reported  results  of  LSD  in  particular.  We 
allowed  each  non-recursive  algorithm  10000  iterations  using  initial  condition  (a)  while  each  trial  of 
(b)  performed  2000  iterations.  The  following  table  reports  the  results.  Our  next  set  of  experiments 


Alg/Data 

4MOONS 

WEBKB4 

OPTDIGITS 

PENDIGITS 

20NEWS 

MNIST 

NCC-TV  [3] 

88.75 

51.76 

95.91 

73.25 

23.20 

88.80 

1SPEC  [11] 

73.92 

39.68 

88.65 

82.42 

11.49 

88.17 

LSD  [1] 

99.40 

54.50 

97.94 

88.44 

41.25 

95.67 

NMFR  [19] 

77.80 

64.32 

97.92 

91.21 

63.93 

96.99 

MTV 

99.53 

59.15 

98.29 

89.06 

39.40 

97.60 

demonstrate  our  algorithm  in  a  transductive  setting.  For  each  data  set  we  randomly  sample  either 
one  label  per  class  or  a  percentage  of  labels  per  class  from  the  ground  truth.  We  then  run  ten  trials  of 
initial  condition  (b)  (propagating  all  labels  instead  of  one)  and  report  the  purity  of  the  lowest  energy 
solution  as  before  along  with  the  average  computational  time  (for  simple  MATLAB  code  running  on 
a  standard  desktop)  of  the  ten  runs.  We  terminate  the  algorithm  once  the  relative  change  in  energy 
falls  below  1 0 1  between  outer  steps  of  algorithm  1.  The  table  below  reports  the  results.  Note  that 
for  well-constructed  graphs  (such  as  MNIST),  our  algorithm  performs  remarkably  well  with  only 
one  label  per  class. 


Labels 

4MOONS 

WEBKB4 

OPTDIGITS 

PENDIGITS 

20NEWS 

MNIST 

1 

99.55/  3.0s 

56.58/  1.8s 

98.29/ 7s 

89.17/  14s 

50.07/  52s 

97.53/ 98s 

1% 

99.55/ 3.1s 

58.75/ 2.0s 

98.29/ 4s 

93.73/ 9s 

61.70/ 54s 

97.59/  54s 

2.5% 

99.55/  1.9s 

57.01/  1.7s 

98.35/ 3s 

95.83/ 7s 

67.61/ 42s 

97.72/  39s 

5% 

99.53/  1.2s 

58.34/  1.3s 

98.38/ 2s 

97.98/ 5s 

70.51/  32s 

97.79/ 31s 

10% 

99.55/  0.8s 

62.01/  1.2s 

98.45/ 2s 

98.22/ 4s 

73.97/ 25s 

98.05/  25s 

Our  non-recursive  MTV  algorithm  vastly  outperforms  the  two  previous  recursive  total  variation 
approaches  and  also  compares  well  with  state-of-the-art  NMF  approaches.  Each  of  MTV,  LSD  and 
NMFR  perform  well  on  manifold  data  sets  such  as  MNIST,  but  NMFR  tends  to  perform  best  on 
noisy,  non-manifold  data  sets.  This  results  from  the  fact  that  NMFR  uses  a  costly  graph  smoothing 
technique  while  our  algorithm  and  LSD  do  not.  We  plan  to  incorporate  such  improvements  into  the 
total  variation  framework  in  future  work.  Lastly,  we  found  procedure  (b)  can  help  overcome  the 
lack  of  convexity  inherent  in  many  clustering  approaches.  We  plan  to  pursue  a  more  principled  and 
efficient  initialization  along  these  lines  in  the  future  as  well.  Overall,  our  total  variation  framework 
presents  a  promising  alternative  to  NMF  methods  due  to  its  strong  mathematical  foundation  and 
tight  relaxation. 
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1  Proofs  of  Theorems 

Theorem  1.  Iff  =  1  A  is  the  indicator  function  of  a  subset  A  C  V  then 

WfWrv  =  2  Cut (A,AC) 

||/  -  medA(/)||  1A  min{A|A|,  |AC|} ' 

Proof.  The  fact  that  ||/||tv  =  2  Cut  (A,  Ac )  follows  directly  from  the  definition  of  the  total  varia¬ 
tion.  Indeed,  a  straightforward  computation  shows 

N  N 

\\f\Wv  =  -  /(xi)l  +  H  X!mdl/(xj)l=  zL  Wii+ 

x.i£A  j= 1  Xi£Ac  j= 1  Xi£AcXj£A 

Thus  H/Htv  =  2  Cut(A,  Ac)  as  W  is  symmetric.  Let  B(f)  :=  \\f  —  medA(/)||1  A.  To  show  that 
B(f)  =  min  {A|A|,  |AC|},  suppose  first  that  A|A|  <  \AC\.  This  inequality  implies  A |A|  <N  —  |A|, 
or  equivalently  that  \A\  <  N/(  1  +  A).  Thus  |A|  <  k  :=  |_iV/(l  +  A)J ,  and  since  f  =  1a  for 
|A|  <  k  it  follows  immediately  that  the  (k  +  l)st  largest  entry  in  the  range  of  f  equals  zero.  Thus 
medA(/)  =  0  by  definition.  A  direct  computation  then  yields  that  B(f)  =  l/(xi)|A  =  A|A|. 

In  the  converse  case,  the  fact  that  \AC\  <  A|A|  implies  |A|  >  N/(  1  +  A)  >  k.  Thus  \A\  >  k  +  1 
and  medA(/)  =  1.  Direct  computation  then  shows  that  B(f)  =  Yhie-v  l/(xi)  -  1|a  =  \AC\  as 
claimed.  □ 

Lemma  1.  Let  h  £  and  suppose  v  £  M.N  satisfies 

(  A  ifh(xi)  >  0 

u(x»)  £  <  [— 1,  A]  ifhfxi)  =  0  (1) 

[-1  ifhfc)  <  0. 

Then  v  £  d\\h\\ ijA. 

Proof.  Note  that  |/i(xj)|A  =  u(x,)ft,(xi)  for  each  x*,  so  that  for  arbitrary  g  £  and  each  x,  the 
inequality 

|<?(xi)|A  -  |/i(xj)|A  >  t>(x»)  (ff(xi)  -  h(xi)) 

holds.  Summing  both  sides  over  all  x$  £  V  then  gives  the  claim.  □ 


1 


Theorem  2.  The  functions  B  and  T  are  convex.  Moreover,  given  f  £  the  vector  v  £ 
defined  by 


fA 

=  j  n  Jn+ 

belongs  to  dB(f). 


if  /(xi)  >  medA(/)  f  n°  =  |{x,;  £  V  :  /(x,)  =  medA(/)}| 

*//(xi)  =  medA(/)  w/zere  <  n"  =  |{x,;  £  V  :  f(xf)  <  medA(/)}| 
if  fix,)  <  med  A(/)  U+  =  |{x»  €  V  :  /(x»)  >  medA(/)}| 


Proof  The  convexity  of  T(f)  follows  directly  from  its  definition  and  a  straightforward  computation 
using  the  definition  of  convexity.  Due  to  the  continuity  B(f ),  to  show  convexity  it  suffices  to 
establish  the  existence  of  a  subdifferential  at  every  point. 

To  this  end  note  that  medA(/)  £  rang e(/),  so  that  in  particular  n°  >  1  by  definition.  Let  1  < 
k  :=  [N/{1  +  A)J  <  N  denote  that  entry  of  /  so  that  /(x*;)  =  medA(/).  By  definition  of 
medA(/)  there  exist  at  most  k  elements  of  /  larger  than  medA(/),  so  that  n+  <  k  <  N/(  1  +  A). 
As  N  =  n~  +  n°  +  n+  this  implies  *n+nan  <  1.  Similarly  there  exist  at  most  N  —  (k  +  1) 
elements  of  /  smaller  than  medA(/),  so  that  n~  <N  —  (k  +  l)<N  —  N/  (1  +  A).  The  fact  that 
N  =  +  ?r°  +  n+  then  implies  ”  fan+  <  A.  Combining  this  with  the  previous  inequality  yields 

-1  <  <  A. 

—  nu  — 

Put  h  :=  /  —  medA(/)l,  and  note  that  the  vector  v  defined  above  satisfies  v  £  9||/r||ijA  by  the 
preceeding  lemma.  Thus  for  any  g  £  it  holds  that 

llfl  -  medA(p)l||ijA  -  ||/  -  medA(/)l||i,A  >  (v,  g  -  f  +  (medA(/)  -  medA(^))l) 
by  definition  of  the  subdifferential.  Note  also  that  (v,  1)  =  0,  so  that  in  fact 

B{g)  -  B(f)  =  || g  -  medA(ff)l||i,A  —  ||/  —  medA(/)l||i>A  >{v,g-  f) 

for  g  £  Rw  arbitrary.  Thus  v  £  dB(f)  by  definition  of  the  subdifferential.  In  particular  dB{f )  is 
always  non-empty,  so  B{f)  is  convex.  □ 


Theorem  3  (Estimate  of  the  energy  descent).  Each  of  the  Fk  belongs  to  C ,  and  if  if:  f  0  then 


Bk+ 


£  -p-  W  -  £f+I)  > 


where  stand  for  E(f*). 


r=  1 
k\ 


|| pk  _  pk+ 1 II 2 


(2) 


Proof.  Let  Vk  £  dBk{Fk).  Then  by  definition  of  the  subdifferential  it  follows  that 

Bk(Fk+l)  >  Bk(Fk)  +  {Fk+1  -  Fk,  Vk ).  (3) 

As  Fk+1  =  pro X']-k+6c(Fk  +  Vk)  the  definition  of  the  proximal  operator  implies  that  Fk+1  £  C 
and  that  also 

Fk  +  Vk  -  Fk+1  £  d{Tk  +  5c)(Fk+1). 

The  definition  of  the  subdifferential  and  the  fact  that  5c{Fk)  =  Sc{Fk+1)  =  0  then  combine  to 
imply 

Tk(Fk)  >  Tk{Fk+1)  +  ( Fk  —  Fk+1,Fk  +  Vk  —  Fk+1) 

=  7 ~k(pk+1)  +  ||  Fk  —  Ffe+1||2  +  ( Fk  —  Fk+1,Vk)  (4) 


Adding  (3)  and  (4)  yields 

Fk(Fk)  +  Bk{Fk+ 1)  >  Fk(Fk+1)  +  Bk{Fk )  +  || Fk  —  Ffe+1||2 
or  equivalently  that  Bk{Fk+1)  >  Fk{Fk+1)  +  ||Ffc  —  Fk+1 1|2  since  Bk(Fk)  = 


Tk{Fk)  by  con¬ 


struction.  Expanding  this  last  inequality  shows 

(■ EkBk+1  -  Tk+1)  >|| Fk-  Fk+1 

r= 1 

which  yields  the  claim  after  by  Bk+1  in  each  term  of  the  summation. 
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□ 
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2  Primal-Dual  Formulation 


Consider  the  minimization 

Fk+1  :=pro  xrfc+5c(Gfc). 

We  may  write  this  as  the  saddle-point  problem 

min  max  Cp,ICu)  +  G(u )  —  F*(p ). 

.U6RJVK  pgRMH 


Here  the  vector  u  =  (/i, . . . ,  jf)1  is  a  “vectorized”  version  of  F  and  the  matrix  JC  denotes  the  block 
diagonal  matrix 


,'Ak  Ak 

K  :=  blkdiag  (  -^K, . . . ,  -^K 


Bf 


R 


where  K  is  the  gradient  matrix  of  the  graph.  We  define  the  convex  function  G(u)  as 


G{u)  :=^f;il/r-^||2  +  M«), 

r—1 

where  Sq  denotes  the  barrier  function  of  the  convex  set  C  (either  the  simplex  or  simplex  with  labels) 
as  before.  The  convex  function  F*  ( p )  denotes  the  barrier  function  of  the  l°°  unit  ball,  so  that 


F*(p) 


0  if  \pi\  <1  Vl<i<  MR 
+oo  otherwise. 


Note  also  that  G(u)  is  uniformly  convex,  in  that  if  v  €  dG(u)  denotes  any  subdifferential  then  for 
any  u!  £  R™  the  inequality 

G(u')  —  G(u)  >  ( v,u'  —  u)  +  —\\u  —  u,||2 

holds.  We  may  therefore  apply  algorithm  2  of  [1]  with  7=1  with  to  solve  the  saddle -point  problem. 
This  algorithm  consists  in  the  iterations 

pn+1  =  proxCT„F,  {pn  +  anK.un ) 
un+l  =  pr0XrnG(Mn  _  TnK}pn+1) 

eu  =  1  Tn+1  =  enTn  ^n+1  = 

x/1  +  2  Tn 

un+1  =  un+1  +  en{un+1  -  un) 

and  converges  provided  the  inequality  a0  <  (r°|  |/C|  H)-1  holds  for  the  initial  timesteps.  We  may 
compute  the  inner  proximal  operators  analytically  to  find 

(prox^n  p*(z))i  =  Zi/ max{  1 ,  |  zt\}  Vl<i<  MR , 

and  by  completing  the  square  that 

proxT„G(2;)  =  projc 

where  g  =  (gk , . . . ,  denotes  Gk  in  vectorized  form.  The  inner  loop  of  algorithm  1  then  follows 

by  re-writing  these  computations  in  matrix  form. 


fz  +  Tng\ 

V  1  +  Tn  )  ’ 
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